How to cite this paper
Challenges and Potential of Local Loading of XML Ebooks
Balisage: The Markup Conference 2011
August 2 - 5, 2011
This paper will describe the local loading process of XML book collections on the Scholars
Portal Ebook platform. Scholars Portal (SP), a service of the Ontario Council of University
Libraries (OCUL), provides the technological infrastructure that preserves and allows access
to information resources collected and shared by Ontario’s 21 university libraries.
The Scholars Portal books platform is designed to provide a single interface for accessing
digital texts from the world’s most important scholarly publishers and public domain books
that have been scanned and digitized for online reading and downloading. Our PDF-based reading
interface offers multiple- page view options, including a grid view to help users easily
navigate in and among books. User accounts allow users to save searches, bookmarks and notes,
as well as to cut and paste small sections of text. The service runs on the ebrary ISIS and
MarkLogic technology— a special-purpose, document-centered database management system that
uses the XML data model and XQuery query language and is optimized for large collections of
both semi-structured and unstructured information.
As the Ebook is becoming more popular and as we see more reading devices, publishers are
moving from PDF to XML formats. My talk describes the challenges we encountered during a pilot
local loading of XML Ebooks and the potential uses of XML format versus PDF. Before
proceeding, we should define “XML Ebooks”; this refers to the many different ways in which the
content of Ebooks is structured in markup. Currently, there is no standard for Ebooks that
publishers follow; as a result, when the Scholars Portal receives content from more than one
publisher, we have to solve the unique problems of each Ebook collection separately. As well,
we cannot apply the same loader—the program we write to put Ebooks on our platform—to
2. Local Loading Explained
The various university libraries in Ontario have built a service that will house and
archive content so that future generations of scholars may continue to access the same content
regardless of changes in subscription policies, which take place so often in the publishing
industry. By “local loading,” we mean the delivery of data files from the publisher/vendor for
presentation via SP software platforms. Local loading of Ebooks has provided Ontario
university libraries with the flexibility and control to create Web-based archival, search and
delivery interfaces that are vendor independent. Furthermore, SP search tools reduce the
complexity of multiple-vendor interfaces and provide our researchers with a common search
interface for millions of journal articles and book chapters. The objectives of SP local
loading are mainly long-term preservation and enhanced discovery.
From the publishers’ point of view, the most important advantage to local loading is
stability. SP is in the process of being recognized as a trusted digital archive by the Center
for Research Libraries - CRL. Moreover, security measures ensure that a publisher’s data will
not be corrupted in the long-term and will be incorporated within new technologies as needed.
The SP platform is also designed to ensure appropriate levels of access to authorized users
because subscriptions vary with each member library. SP staff provides technical support that
would otherwise be directed to publishers. Finally, SP search interfaces are heavily used by
Ontario researchers, and SP metadata is indexed by Google; thus, SP can guarantee high
3. Scholars Portal’s Ebooks Platform
Currently, Scholars Portal Books numbers more than 350,000 Ebooks, which until recently,
were received from publishers mostly in PDF format. The local loading procedure for PDF using
ebrary’s ISIS Toolkit comprises several stages. The first stage is preparing the PDF file:
Some publishers deliver us Ebooks as individual chapters, and therefore we need to create a
unified PDF Ebook by merging all those files together. The second stage is finding MARC
records for each book: from the MARC records, we extract the metadata required for loading and
later, also for publishing, so that other libraries can load the records on their OPACs.
During the loading, we modify the MARC to contain a link to our platform (the 856 field). At
the third stage, we generate a MODS file from the MARC record. Research and bibliographic
management tools, such as Zotero, use MODS files.
Once we have matched the PDF with its MARC record, we submit the PDF and a set of
key-value metadata to ISIS through a service API. Each book on the Ebook platform has a
book-i.d. that defines its URL (http://books2.scholarsportal.info/viewdoc.html?id=371372) and
a permalink that contains one of the ISBN numbers from its MARC record
4. Ebooks Format Change: PDF to XML
The workflow involved in loading XML books is different from that with PDF books, but
still shares some common practices: (1) Matching: pairing each book with a MARC file; (2)
Generating a PDF file by extracting each book’s text and feeding the PDF, along with the
metadata from the MARC file into our ISIS platform (this is a workaround to make the books
searchable on our PDF-based platform); (3) Loading the XML books into Marklogic (after
cleanup). In order to display the books on our platform, we use a collection-specific XSLT
transformer to convert the book into html. Consequently, each time someone tries to read an
XML book on our platform instead of calling the rasterization service (which we normally use
for PDF content), an XML service will be triggered that fetches the requested section of the
book from Marklogic and transforms it into HTML (server side) and before serving it to
a. Processing Instructions in the Markup
Our pilot loading of XML books included over 500 titles of Lippincott Williams &
Wilkins (LWW). Although LWW gave us schema files, since each publisher uses a different
schema, we were forced to pretty much ignore it.
Our pilot loading of XML books included over 500 titles published by Lippincott Williams
& Wilkins (LWW). Although LWW gave us schema files, we were forced pretty much to ignore
them, since each publisher uses a different schema. The first step was to add the XML files to
our MarkLogic server; however, doing so returned invalid XML errors, which required our
programmers to fix the processing instructions for the data. For instance, pages appeared as
follows: <?PG 155> and then <?PG 156>. Thus one of the tasks was to replace each such
error with, say, <<PG>155</PG>> and so on. It should be noted that even if we write a
program to clean up this type of instruction, we cannot use it for other XML Ebooks. For
instance, Oxford University Press—which we plan on loading next—created its instruction for
pages as follows: <?Page pageId="2"?>. The difference in the instructions also demonstrates
the inconsistency in using Markup for Ebooks.
b. Generating TOC Table of content (TOC) and thumbnail
Generations are rather long processes and must be done asynchronously. After loading each
batch of books, we used ActiveMQ to create and run tasks for each process. In order to build a
TOC, we used a chapter’s ID to provide anchor points for linking. With the Lippincott Williams
& Wilkins collection, however, we ran into a problem with chapter hierarchy. The LWW
Ebooks’ organizational structure includes sections nested within other sections. For instance,
an I.D. such as “B00139907-DA1-DB1-C1” was not usable for us, since the machine could not tell
the difference between C1 when stored under DB1 or its being the first chapter of the book.
This screenshot shows a typical TOC structure for the LWW books TOC.
c. Structuring the Content via XSLT
Once the XML was cleaned and successfully loaded onto MarkLogic, we had to face problems
associated with the XSLT transformation. The XSLT file that came with the LWW Ebooks was
created by Ovid technologies. Although it is a legitimate XSLT, it made for a poor reading
experience when trying to use it as is, thus necessitating a major adaptation to fit our
web-based reader. Among the prominent problems of this XSLT model is the size of tables. Their
size always exceeds the viewport, thus making it very difficult for users to see the text or
to navigate back to the page. Resizing the tables and adding frames to tables and images
solved this problem. For instance: <div style="height: 100px; overflow: scroll">
</div> The following screenshot illustrate the problem, and represent a page before
we applied the changes to the XSLT.
Figure 2: Structuring Tables for the browser default behavior
Another major problem has to do with references and footnote linking. If users click on
the footnote marker to return to the text, they will not be taken back to the same place at
the page; or if this reference also appears elsewhere in the text, it will, by default, take
them to where it first appeared. In some instances, references are also not linked. This
structure is highly challenging for a reader’s orientation in that it forces the person to go
back and forth while reading. A possible solution might be to highlight the reference number
both on the way to the note and back in the body text. Since this solution involves a great
amount of coding, we are still investigating alternative solutions. The challenge is that
solutions need to be discreet and not depend on redesigning the book, which would involve
copyright issues, a subject that is beyond the scope of this paper.
Additionally, we were forced to turn off some of the functionalities that arrived in LWW’s
XSLT sheet. For instance, for cross-chapter links, LWW has included CGI calls in its
stylesheets; as a result, the XSLT file cannot readily be reused in other platforms without
modification. We changed some calls, then, to local CGI and others we ignored. At the same
time, we were able to turn on the in-chapter links. Another major problem was the display of
images. This required us to write an image controller that finds images based on ID’s and
serves them to the browser. We also had to modify the XSLT to correct the address of the
images to point to our controller (handler).
d. Scholars Portal’s Reader: PDF versus XML
The SP reader originally developed for the platform was based on ebrary’s model of
converting PDF vector information into a raster format. We use the ebrary rasterization
engine, but opted to develop our own interface in order to create a more seamless user
experience. We also provide a number of enhancements that were lacking in ebrary’s web reader
at the time. These include a two-page view and a grid view for augmented navigation.
Additional features that we added include text clipping from the rasterized page, exporting a
limited number of pages to PDF and the enabling of text searching. Recently we implemented
user accounts that allow users to save and sort books, bookmark pages, highlight text and
write page notes.
With XML Ebooks, the most significant difference is that we no longer need to provide a
“reader,” because the browser itself becomes the reader: what was once the static image of a
printed page is now an HTML page. This also means that traditional reading tools, such as page
magnification, bookmarking, multiple page views, text clipping and PDF page exporting, are no
longer relevant features to worry about, because most of these tools are now included in the
default behavior of most web browsers. Our initial version of the HTML page for XML Ebooks
offered users the ability to view a table of contents, to search text across chapters, to link
to endnotes or bibliographies and to write chapter notes. For instance, screen shot 4 shows
how we are planning to facilitate chapter-based annotation.
Since MarkLogic indexes all XML data, we took advantage of its Search API, using the
namespace module to perform full-text searching. Up to now, we have dealt with only one type
of XML books. However, after viewing other collections that we plan to load, it became clear
that the structuring of sections or chapters varied from collection to collection, and that we
will have to write a custom query collection-specific xquery in order to enable searches
within each section or chapter. The presentation of relevance-scoring is also under
construction. Screen shot 4 shows the search results using ML Search API. In this example we
used “chapters” as the definition of which parts of a document should be searched, based on
our indexes in XQuerty. Once a user enters a search term, the results open up on the bottom
like a drawer and are listed with an excerpt of the phrase surrounding the search term, which
is highlighted. A user can then click on a result and be taken to that section of the text.
Figure 3: Reader's Functionalities
Figure 4: Search Results Presentation
e. In the Land of EPUB3
EPUB is a specification intended to provide a standard way in which to interchange and
deliver reflowable content to reading system. If EPUB3, the new generation of the EPUB
specification recently issued by the International Digital Publishing Forum (IDPF), is adopted
rapidly by publishers, it certainly will enable more control over knowing what it is that we
are receiving from the publishers. This development, in turn, may bring back the comfort
associated with receiving PDFs from publishers, however limited a format that may be. Yet,
unlike the PDF format, EPUB has functions called "media queries" that enable the layout design
to adapt easily to the files' new "home," be it a laptop or an iphone. When treating Ebooks
that contain many images, such as do the LWW books, EPUB3 will hopefully relieve us of the
need to take control of the graphic design. It will be left to EPUB, because of its vertical
writing capacity, to help images to adapt to the size and resolution of their rendering
Also, we will need to write an EPUB reader for MarkLogic, and hopefully, since EPUB3
accommodates more extensive metadata at all levels, even to the level of the single paragraph,
we will also be able to take advantage of EBPUB3 support of new semantic markup features.
The local loading of XML Ebooks has created new challenges and opportunities previously
unavailable for Ebooks in PDF formats. As for processing instructions for XML data, we did not
find instructions; and if we did, they were rarely complete. We attribute the minimal and
often incomplete existence of processing instructions for XML files to the fact that local
loading is not yet common practice. Thus, publishers still do not think about their files as
needing processing by others or as having to perform in other environments. In most cases, the
libraries simply point to the publisher’s website.
While problems with enriched metadata, such as TOC, can be challenging for all types and
formats of Ebooks, the metadata for XML Ebooks can be found either in the XSLT sheets or in
the XML schema (for instance, breaking up the book into pages). There are, then, various
strategies for creating administrative metadata, all with their own advantages and drawbacks.
Because of copyright concerns, we do not want to touch the data; consequently, modifications
are done mostly at an XSLT level and in accordance with a browser’s default behavior. In the
case of the LWW Ebooks, for instance, we found that a more detailed XML schema might result in
a better presentation of the content.
With no acceptable format adopted by publishers for Ebooks, such as ePub, the optimization
of XML Ebooks is dependent on human resources and coding invested by the hosting platform. In
the case of Scholars Portal Books, the local loading of Ebooks received from various
publishers has taken the technological burden off the publishers’ shoulders; at the same time,
it has given us the freedom to experiment with the best available text-display formats and
practices. That said, cleaning XML and applying extensive modifications to the XSLT sheets act
to slow down the production line on our platform and to create a long time lag between the
time the OCUL schools bought the books and the time they are available on our platform. In
addition, since the advantage of MarkLogic is that any type of xml can be stored on it, using
our limited number of software developers to create one format for all the publishers with
which we work does not make sense.
Furthermore, some XSLT modifications may lead to the redesign of a book and, therefore,
possibly cause licensing difficulties for the hosting platform. Thus, in addition to
standardization, it is important for both publishers and libraries to start thinking about new
content management strategies for Ebooks that will include format and presentation models.
Moving beyond PDF readers, XML Ebooks take advantage of web browser functionality and
provide enhanced reading utilities, such as full-text searching. With MarkLogic’s indexing
capabilities, search functionalities in XML Ebooks can be limited to chapters or sub-sections.
As each publisher may present content differently, however, the solutions found for LWW Ebooks
may not be serviceable for other publishers. The Scholars Portal technical team would then be
required to start anew with each collection or publisher. It is little wonder, then, that the
current still- early-development state of XML Ebooks demands a great deal of resources at the
hosting platform’s end. The success of Ebooks is affected by such factors as the cost of
content and the feel of reading. It is possible that as the publishing industry moves further
into XML Ebooks, new standards for displaying content will be adopted, allowing platforms such
as Scholars Portal to receive Ebooks in a ready-for-production state and to concentrate on
developing reading utilities.
Baumann, Michael. “Ebooks: A New School of Thought.” Information Today 27.5 (2010):
Freese, Eric. “Multi-Channel eBook Production as a Function of Diverse Target
Device Capabilities.” In Proceedings of Balisage: The Markup Conference 2010. Balisage Series
on Markup Technologies 5, Montreal, 2010. doi:10.4242/BalisageVol5.Freese01.
Kasdorf, Bill. “EPUB 3: Not Your Father's EPUB.” Information Standards Quarterly
(ISQ) 23.2 (2011): 4-11. doi:10.3789/isqv23n2.2011.02.
Kyong-Ho Lee, Nicholas Guttenberg and Victor McCrary, “Standardization Aspects of
eBook Content Formats.” Computer Standards and Interfaces 24 (2002): 227-239. doi:10.1016/S0920-5489(02)00032-6.
McDermott, Rene E. “Ebooks and Libraries.” Internet Express (March 2011): 7-11,