SOCRview: a case study of RESTful service development for publishing

John Cooper

Senior Content Systems Analyst

SAGE Publications, London

Copyright © 2017 Sage Publications

expand Abstract

expand John Cooper

Balisage logo

Preliminary Proceedings

expand How to cite this paper

SOCRview: a case study of RESTful service development for publishing

Balisage: The Markup Conference 2017
August 1 - 4, 2017

Context and Goals

SAGE is an academic publisher whose content is marked-up in XML and stored in an Content Management System (CMS) known internally as SOCR (SAGE Online Content Repository). Many types of content are stored in SOCR but this paper will focus on journals.

SOCR is a CMS with typical characteristics: content comes in (ingestion, validation), goes out (reports, searches, delivery) and is stored/archived. At its core SOCR consists of two applications: a front-end that provides user access and also contains a workflow engine; and a back-end XML database.

This paper will present how content goes out, is accessed through a service, SOCRview, running on an XML database; starting from initial motivations and an XML database: services, URIs, and a REST framework will be sequentially added to the mix.

Motivation

Content should not be hidden away, only accessible through expert database specialists. Storing content in a database system has advantages of consistency, scalability and security but often accessing the content requires special knowledge and privileges. Wouldn't it be nice if typical access could be provided in a simple and intuitive way using, say, HTTP and more advanced access made easier with a standardized configuration layer (ideally in XML?)

Goals

  • Demonstrate the accessibility of content through a simple HTTP interface

  • Design persistent, readable, meaningful and succinct URIs for content and use them to access content

  • Use variations on the core URI, by adding extensions and postfixes, to access different views of the content including metadata, reports and transformations

  • Create a customizable transformation layer to implement complex or non-standard views

Note

From the beginning, browser access was useful and important but the goal was not to create a web application. The goal was to provide simple and intuitive HTTP-URL based access to content that could be used by services, programmers writing ad hoc scripts or a web application. To date, there is no web application, just a very thin XSLT-to-HTML layer.

XML Database Services

To set the stage for what follows it necessary to understand a little about services written in XQuery. A minimum configuration can consist of specifying a port and location for XQuery files. The examples below demonstrate a simple content query and how to access the requesting URL.

A service is written like an XQuery program where the input context is all the documents in the database, as if the database was one giant root document and each actual document a child of the root document. This example illustrates a service running on port 8123 that returns an arbitrary article. The following is placed in a file, one-article.xqy:

(/article)[1]
Opening the following URL in a browser will return one article
http://localhost:8123/one-article.xqy
Returns
<article article-type="research-article" dtd-version="1.1d1" r:rsuiteId="6536723" xml:lang="EN">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">EPM</journal-id>
  ...
</article>

There was one word excluded from the stated goal of accessing content using URIs: "directly." We want to use the URI directly and not as a unique id in a parameter.

Not like this:

http://localhost:8080/goGoGadget.xqy?uri=/a/b/c

Yes, like this:

http://localhost:8080/a/b/c

This can be achieved by configuring the service to redirect requests. Below is an example of redirecting all requests to XQuery file simple-service.xqy that contains the following:

let $url := xdmp:get-original-url()

return
  if ( $url eq '/one-article' ) then
    (/article)[1]
  else
    fn:concat("URL: ",$url)
To return one article:
http://localhost:8080/one-article
Otherwise, just echo the request URL:
http://localhost:8080/a/b/c
Returns
URL: /a/b/c

The XQuery program above, simple-service.xqy, shows how an HTTP service can interpret a request URL and access content. The next step would be to use regular expressions to match the request URL, isolating the object URI from modifiers that will indicate which aspect of the object is to be returned: (e.g. the object itself, its metadata, a transformed version of the object, etc.)

URI - Initial Analysis

The catalyst for introducing URI design for journals came from a 2012 MarkLogic Users Group London (MUGL) meeting where Jeni Tennison presented her technical approach and architecture for UK legislationMUGL2012; in particular the utility of meaningful persistent URIs and how modifiers could be applied to view different aspects of an object. This presentation at MUGL led to combining an analysis of our journal content and examples of how other systems provided URI-based access to journal content into an initial URI design.

SOCR already had RESTful access to content

http://localhost:8080/rsuite/rest/v1/content/38024?skey=1345
RSuiteAPI Each document has a unique positive integer that works well to identify content when a CMS is generalized to store anything. But it is not meaningful and the id is not persistent; if you delete a document and add it again it would get a new, different id.

Examples from other publishers

Legacy academic publishing, organized by volume and issue and published in print as well as online, lends itself to hierarchical URIs. An hierarchical URI can be seen on HighWire in example below: article on page 395 of journal aas, volume 25, issue 4. This is meaningful but not at the article level as the page number is less meaningful than an article DOI and also tied to a particular PDF rendering.

http://aas.sagepub.com/content/25/4/395

All SAGE journal articles are identified with a Digital Object Identifier (DOI) DOI. HighWire used DOIs as an alternate for directly accessing articles. Below shows the HighWire URL for accessing journal aas, article DOI 10.1177/009539979402500401.

http://aas.sagepub.com/lookup/doi/10.1177/009539979402500401

Some online-only journals use the DOI as the primary identifier to access content.

  • BioMed Central, "Big Data Analytics" DOI 10.1186/s41044-017-0021-9

    https://bdataanalytics.biomedcentral.com/articles/10.1186/s41044-017-0021-9

  • Public Library of Science, "PLOS ONE" DOI 10.1371/journal.pone.0127502

    http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0127502

Core URI design

Given that articles can be uniquely identified through a DOI, why not stop there? Why include journal, volume and issue identifiers?

  • In SOCR the DOI is not unique because we have parallel versions. For example when an article first enters the system it is in the form of an Accepted Manuscript which has not been assigned a volume or issue. There is a requirement to keep the Accepted Manuscript separately and not as a version of a single article.

  • There is structure in the database for navigating journal content -- journal, volume, issue and article container objects and these also must have URI

  • There is content at the issue level (e.g. cover images)

  • It is useful to apply metadata at the journal, volume, issue and article levels (e.g. if you want act on an issue as a whole)

  • Finally, even if it was not necessary to have identifiers above the article level, it is meaningful to able to know where an article belongs based on its URI

So, given the examples above a reasonable base URI for a journal article might be:

/AAS/25/4/10.1177/009539979402500401

Except we want to use a normalized DOI where forward slash (and most non-alphanumeric characters) is replaced by underscore:

/AAS/25/4/10.1177_009539979402500401

Normalizing was a naturally step since the input files are named using the DOI. Later, when creating SOCRview, using normalized DOI will simplify the regular expressions that parse the URI. Also, we have DOI like this:

10.1597/1545-1569(1995)032<0206:pfaoat>2.3.co;2

Although this paper is focused on journals, there are other types of content in SOCR and in order to unambiguously interpret the URI--especially in a service as described above that will want to respond differently based on matching URI with regular expressions--a namespace-like prefix will indicate that this is a journal URI:

/journal/AAS/25/4/10.1177_009539979402500401

URI for all objects

A journal consists of one or more volumes, a volume of one or more issues, an issue of one or more articles and an article of at least the article XML with optional PDF and images. Below shows the hierarchical structure of content stored in SOCR. Though not complete the listing below shows a nested series of containers and objects. Each node is an XML document: a container contains references to its children; graphics and PDF contain references to files on a file system.

  • journal AJS /journal/AJS

    • volume 44 /journal/AJS/44

      • issue 9 /journal/AJS/44/9

        • issue cover image /journal/AJS/44/9/AJS_44_9_cover.tif

        • article 10.1177_0363546515618372 /journal/AJS/44/9/10.1177_0363546515618372

          • article XML /journal/AJS/44/9/10.1177_0363546515618372.xml

          • article PDF /journal/AJS/44/9/10.1177_0363546515618372.pdf

          • article graphic /journal/AJS/44/9/10.1177_0363546515618372-fig1.tif

URI for objects inside an article container: why exclude the article container level from the URI?

/journal/AJS/44/9/10.1177_0363546515618372.xml
rather than
/journal/AJS/44/9/10.1177_0363546515618372/10.1177_0363546515618372.xml
One of the goals was to have succinct URI without unnecessary repetition. Taking advantage of restricted naming conventions, requiring all objects belonging to an article to start with the normalized DOI, allowed the former approach, not unnecessarily repeating the DOI. If this restriction was not present then the latter approach would have been used
/journal/AJS/44/9/10.1177_0363546515618372/foobar.xml

The initial set of URI implemented in the first version of SOCRview also had modifiers on the core URI to provide transformations and different views (e.g. /journal/AJS/44/9.zip would return a zip file of all content in the issue); this will be explored later when discussing the current version of SOCRview.

SOCRview Proof Of Concept (POC)

Full details of the POC would show little but two aspects have bearing on what follows: most importantly it successfully accomplished most of the stated goals and demonstrated what was possible in a way that words by themselves did not; the approach taken to matching an parsing URI was not ideal.

POC URI processing

The approach taken to processing the URI consisted of tokenizing the URI and then making decisions based on the decomposed parts of the URI (i.e. journal, volume, issue, article, extension, etc.) The service worked and was performant but it was difficult to understand and maintain; each additional endpoint to the service increased the complexity of the code. Also, the approach was contrary to the spirit of having persistent URI for objects. There is a different philosophy/approach in play when matching an URI with a given pattern but subsequently treating it as a single identifier.

Production SOCRview using RXQ

An alternative approach to matching and parsing URIs presented itself at another MUGL where meeting where Jim Fuller presenting his RESTXQ library which use XQuery function annotations to expose RESTful services in MarkLogicMUGL2014. Jim's RESTXQ library is based on Adam Retter's RESTXQ draft RESTXQspec presented at XML Prague 2012XMLPrague2012.

The RXQ library makes use of XQuery annotationsXQuery3 on function declarations. Every entry point (endpoint) of the service will have function declared as in the example below. This example shows the default behaviour when no URI is provided, the root URI '/', return a static table of contents XML document:

declare
%rxq:produces('text/xml')
%rxq:GET
%rxq:path('/')
function toc() { static:toc() };
The above example shows three annotations used in SOCRview; this paper will focus on the rxq:path annotation containing a regular expression string.

Before showing the %rxq:path annotations that would match URI, as proposed above, it is necessary to explain an enhancement made to RXQ. As ubiquitous and powerful as regular expressions are they can be cryptic--especially for complex patterns--and difficult to understand or modify; more, a programmer should be able to look at a regular expression in an annotation and understand the URI it is intended to match. An abstraction layer was added to add symbolic patterns/pattern variables. Pattern variables are defined in a map:

let $m := map:map()
let $_ := map:put($m,'$doi','(10\.\d{4,5}_[^/]+)')
let $_ := map:put($m,'$tla','([A-Z]*)')
let $_ := map:put($m,'$vol','([^/]+)')
let $_ := map:put($m,'$iss','([^/]+)')
let $_ := map:put($m,'$obj','([^/]+\.$objext)')
let $_ := map:put($m,'$objext','([a-z]+)')
...
Changes were made to the RXQ library to resolve these variables. Finally the variables are used in a function declaration. The following function will match any of the above object URI and return the object:
declare
  %rxq:GET
  %rxq:path('(/journal(/$tla(/$vol(/$iss)?)?)?(/$obj|/$doi)?)($filter)?')
function jrnlObject(
  $socrUri, $_1, $_tla, $_2, $_vol, $_3, $_iss, $_4, $_obj, $_objext, $_doi, $_5, $filter
)
{ uf:applyFilters($filter,_getObject($socrUri)) };

In the above function declaration:

  • $tla - Three Letter Acronym - a journal code (e.g. AJS = "The American Journal of Sports Medicine")

  • $vol - volume

  • $iss - issue

  • $obj - an object name - a file name

  • $doi - DOI

  • $filter - to be explained later

Parentheses in regular expressions are used to isolate sub-expressions and capture text. These capture groupsRegular Expressions are assigned to a corresponding variable in the declared function. In the example most of the capture patterns are not used; only the URI and filter are used. A future enhancement could implement non-capture groups so that only required capture groups are assigned to variables. A future enhancement might also disallow capture groups inside pattern variables so that what is captured can be understood just from reading the %rxq:path.

Using RXQ allows for better organization and maintenance of service endpoints. Functions that match URI with complex patterns can be created that act upon the URI, applying any modifiers.

Views

So far all examples of URI and corresponding endpoints have corresponded to objects (container, non-XML or XML nodes); views are anything the can be derived from an URI and some modifying suffixes. Here are some examples of views:

  • metadata associated with object

  • zip file containing all content in an issue

  • the most recent cover image for a journal

  • transformed XML

Standard extension based views

Simple file extensions (e.g. .html) are used to show structural aspects of an object. Structural aspects mean either

  1. resolving the internal integer-based linking to SOCR URI to allow for simple rendering and navigation in a browser

  2. resources, metadata or the raw integer based linking from container to child – mostly used by administrators or developers

To illustrated typical structural views, below are 4 views for a container node:

  • resource metadata -- every object (container, XML or non-XML) has a corresponding resource / metadata document that can be access by appending a .res extension

    /journal/AJS.res
  • raw XML document container -- contain numerical ids pointing to its children

    /journal/AJS
  • XML document listing the children where the children are referenced by URI

    /journal/AJS.lst
  • HTML view child list

    /journal/AJS.html

    this view converts document list above into HTML by adding a processing instruction that will run an XSLT 1.0 program, pretty.xsl, in a browser:

    <?xml-stylesheet type="text/xsl" href="/xslt/pretty.xsl"?>
  • Map - this is an XML representation of the database structure starting at the given URI:

    /journal/AAN/25/3/10.1177_0218492315603212.map
    <container name="10.1177_0218492315603212" type="rs_ca" socrUri="/journal/AAN/25/3/10.1177_0218492315603212" id="167632462">
      <title>10.1177_0218492315603212</title>
      <meta name="tla">AAN</meta>
      <meta name="volume">25</meta>
      <meta name="issue">3</meta>
      <meta name="year">2017</meta>
      <meta name="doi">10.1177/0218492315603212</meta>
      <meta name="articleType">case-report</meta>
      <object name="10.1177_0218492315603212-fig2.tif" type="nonxml" socrUri="/journal/AAN/25/3/10.1177_0218492315603212-fig2.tif" id="167632519">
        <title>10.1177_0218492315603212-fig2.tif</title>
        <meta name="tla">AAN</meta>
        ...
        <meta name="md5sum">4882ffc15e361c9bd5737ba1c5855372</meta>
        <created>2017-03-21T16:04:59.004Z</created>
        <modified>2017-03-21T16:04:59.243Z</modified>
      </object>
      <object name="10.1177_0218492315603212.xml" type="article" socrUri="/journal/AAN/25/3/10.1177_0218492315603212.xml" id="167632474">
        <title>Angina in left main coronary artery occlusion by pulmonary artery aneurysm</title>
        <meta name="tla">AAN</meta>
      ...

Deliveries: packages and report

SOCR has over 100 delivery targets the vast majority of which are simple: a zip file of some or all of the content of a journal issue. There also some highly customized deliveries (e.g. Epub). Naturally there are deliveries that fall somewhere in between and the challenge was to push as much of these onto production where they only need copy a configuration file, change a few ids, and perhaps add or override XML transformations. But always new requirements kept pushing the complexity of transformations specified in the delivery configuration file: multiple transformations; conditional transformations; etc. XProc was considered but was not a natural fit; XSLT was a natural fit; each delivery has two levels of configuration requiring 2 levels of expertise: an XML delivery configuration file customizable by production users and an XSLT packaging program requiring a developer.

Deliveries are views where the URL consists of an object URI, a delivery identifier and an extension .rpt, .dlvr or .zip. For example, the following creates a zip file of all content belonging to an issue:

/journal/AJS/44/9/localDelivery.zip

The delivery identifier is "localDelivery"; every delivery identifier must resolve to a deliver configuration XML fragment; SOCRview will first look for the configuration in an static variable, for standard system deliveries, or an external document that can be customizable by users, for bespoke deliveries. localDelivery is system delivery with the following configuration:

<deliveryConfig id="localDelivery">
  <pkgList type="xslt" uri="/deliver/localDelivery.xsl"/>
</deliveryConfig>

Package delivery

A package delivery assumes the content exists, constructs a map of the content structure rooted at the given URI (see Standard extension based views, above), runs an XSLT to transform the map into a package specification (XML) that is then interpreted to constructed the final package, usually a zip file.

Map -> Packaging XSLT -> Package Specification -> Package

A request URL of

/journal/AJS/44/9/localDelivery.zip
will process a map as listed above through an XSLT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs"
  version="2.0">

  <xsl:template match="container">
    <transform type="zip">
       <xsl:apply-templates select=".//object"/>
    </transform>
  </xsl:template>

  <xsl:template match="object[@type ne 'nonxml']">
    <transform name="{util:getObjName(.)}" type="xqyfn" fn="serialize">
      <param name="addHeader"/>
      <param name="removeRsuite"/>
      <object uri="{@socrUri}"/>
    </transform>
  </xsl:template>
  
  <xsl:template match="object[@type eq 'nonxml']">
    <object name="{@@name}" uri="{@socrUri}"/>
  </xsl:template>

</xsl:stylesheet>
to create a package specification
<transform type="zip">
  <object name="10.1177_0218492315603212.pdf" uri="/journal/AAN/25/3/10.1177_0218492315603212.pdf"/>
  <object name="10.1177_0218492315603212-fig3.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig3.tif"/>
  <object name="10.1177_0218492315603212-fig2.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig2.tif"/>
  <transform name="10.1177_0218492315603212.xml" type="xqyfn" fn="serialize">
    <param name="addHeader"/>
    <param name="removeRsuite"/>
    <object uri="/journal/AAN/25/3/10.1177_0218492315603212.xml"/>
  </transform>
  <object name="10.1177_0218492315603212-fig1.tif" uri="/journal/AAN/25/3/10.1177_0218492315603212-fig1.tif"/>
</transform>
which will return a zip file, localDelivery.zip
Archive:  localDelivery.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
    17111  07-09-2017 17:42   10.1177_0218492315603212.xml
  1737310  07-09-2017 17:42   10.1177_0218492315603212-fig1.tif
   303239  07-09-2017 17:42   10.1177_0218492315603212.pdf
  5352824  07-09-2017 17:42   10.1177_0218492315603212-fig2.tif
  1612126  07-09-2017 17:42   10.1177_0218492315603212-fig3.tif
---------                     -------
  9022610                     5 files

Report Delivery

A SOCRview report simply runs an XQuery function passing the URI and a report id; there are no other restrictions and URI does not have to resolve to existing content.

Example below returns list of all journal issues where provided DOI is used, excluding provided URI; if DOI is unique then list will be empty; if DOI is not unique it will return URI where already used.

A request URI,

/journal/AAN/25/3/10.1177_0218492315603212/uniqueDoi.rpt
, will use internal delivery configuration,
<deliveryConfig id="uniqueDoi">
  <report>
    <function
    fnName="uniqueDoi"
    fnNamespace="http://sagepub.org/socrview/report"
    fnLocation="/modules/report.xqy"/>
  </report>
</deliveryConfig>
, run following XQuery function,
declare function uniqueDoi(
  $socrUri as xs:string
, $refxml as node()
)
{...};
and return following result,
<socrUris/>
indicating that DOI is unique.

Filters

The final type of view is a filter: a sequence of one or more XSLT, XQuery or XPath expressions run on the content obtained from an URI or URI view. Multiple filters can be executed, left to right. XSLT or XQuery expressions will resolve to program files that form part of SOCRview code. XPath expressions can be ad hoc and reference any namespaces or functions declared or visible in the code context where the filter is evaluated. Parameters can be used and will be supplied to every XSLT or XQuery module referenced in the filter; if the parameter is not declared it simply be ignored.

Multiple XSLT filters

Example below will apply 2 XSLT filters to an XML object

wrapper-one.xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xlink="http://www.w3.org/1999/xlink"
>

<xsl:template match="*">
  <wrapperOne>
    <xsl:copy-of select="."/>
  </wrapperOne>
</xsl:template>

</xsl:stylesheet>
         

wrapper-id.xsl

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xlink="http://www.w3.org/1999/xlink"
>

<xsl:param name="id" select="'Default'"/>

<xsl:template match="*">
  <xsl:element name="{concat('wrapper',$id)}">
    <xsl:copy-of select="."/>
  </xsl:element>
</xsl:template>

</xsl:stylesheet>

Applying wrapper-one.xsl

/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xslt/wrapper-one.xsl
returns
<wrapperOne>
  <article article-type="case-report" dtd-version="1.1d1" r:rsuiteId="167632474" xml:lang="en">
    <front>
...

Applying wrapper-one.xsl, then wrapper-id.xsl

/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xslt/wrapper-one.xsl/xslt/wrapper-id.xsl?id=Two&dummy=Null
returns
<wrapperTwo>
  <wrapperOne>
    <article article-type="case-report" dtd-version="1.1d1" r:rsuiteId="167632474" xml:lang="en">
      <front>
...
Notice that parameter id was applied but the non-existent parameter, dummy, was ignored.

XPath filters

Example, md5sum for an image:

/journal/AAN/25/3/10.1177_0218492315603212.pdf/__filter/xdmp:md5(binary()).xpath 
12130105eaeaf74a21cbe457b8b70bd0

Example, byte count for XML:

/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/xdmp:binary-size(xdmp:unquote(xdmp:quote(.),(),"format-binary")/binary()).xpath
17061

Example, abstract from article XML:

/journal/AAN/25/3/10.1177_0218492315603212.xml/__filter/descendant::abstract.xpath
<abstract>
  <p>A 51-year-old woman with exercise angina and a history of pulmonary artery hypertension ...</p>
  <p>After a multidisciplinary evaluation,...</p>
</abstract>

Summary

The initial motivation of exposing SAGE's journal content through a simplified interface and the goals of building this interface through an HTTP service utilizing persistent, readable, meaningful and succinct URI was achieved. The usefulness of approach has so far mostly been seen in redesigning SOCR as multiple services but the browser interface has also proven popular and useful for technical users in our publishing systems group--who created and maintain SOCR--and the production group--who use SOCR. There has also been a gradual increase content accesses through scripts (e.g. data scientists using Python). URI design for journals has met all stated goals but URI design for non-journal content has been less satisfying because of tendency to view content base on its form (i.e. markup – e.g. TEI, DocBook, etc.) rather than its function (e.g. a book)--a salutary lesson that the time spent thinking about journal URI design was well spent. The use of RXQ has allowed easy additions of new, non-journal, content types. Having multiple levels of configuration for deliveries has allowed simple new deliveries to be created without the intervention of a developer or administrator and complex deliveries, requiring development, to be created faster.

Author's keywords for this paper: configuration; transformation; XML database; XQuery; XSLT; XPath; RESTful; RXQ; RESTXQ; MarkLogic; web application; content management; CMS; case study; HTTP; Service Oriented Architecture; URI; code injection; single source publishing; RSuite; Regular Expressions; regex